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Converging evidence suggests that common complex diseases with 
the same or similar clinical manifestations could have different under¬ 
lying genetic etiologies. While current research interests have shifted 
toward uncovering rare variants and structural variations predispos¬ 
ing to human diseases, the impact of heterogeneity in genetic studies 
of complex diseases has been largely overlooked. Most of the exist¬ 
ing statistical methods assume the disease under investigation has a 
homogeneous genetic effect and could, therefore, have low power if 
the disease undergoes heterogeneous pathophysiological and etiolog¬ 
ical processes. In this paper, we propose a heterogeneity weighted U 
(HWU) method for association analyses considering genetic hetero¬ 
geneity. HWU can be applied to various types of phenotypes (e.g., 
binary and continuous) and is computationally efficient for high¬ 
dimensional genetic data. Through simulations, we showed the ad¬ 
vantage of HWU when the underlying genetic etiology of a disease 
was heterogeneous, as well as the robustness of HWU against differ¬ 
ent model assumptions (e.g., phenotype distributions). Using HWU, 
we conducted a genome-wide analysis of nicotine dependence from 
the Study of Addiction: Genetics and Environments (SAGE) dataset. 

The genome-wide analysis of nearly one million genetic markers took 
7 hours, identifying heterogeneous effects of two new genes (i.e., 

CYP3A5 and IKBKB) on nicotine dependence. 


1. Introduction. Benefiting from high-throughput technology and ever- 
decreasing genotyping cost, large-scale genome-wide and sequencing studies 
have become commonplace in biomedical research. From these large-scale 
studies, thousands of genetic variants have been identified as associated with 
complex human diseases, some with compelling biological plausibility for a 

‘Assistant Professor of Biostatistics, Department of Biostatistics and Epidemiology, 
University of North Texas Health Science Center, (Email: changshuai.wei@unthsc.edu) 
Trofessor of Biostatistics, Department of Epidemiology and Biostatistics, Case West¬ 
ern Reserve University. (Email: robert.elston@cwru.edu ) 

* Corresponding Author, Assosicate Professor of Biostatistics, Department of Epidemi¬ 
ology and Biostatistics, Michigan State University. (Email: qlu@epi.msu.edu) 

Keywords and phrases: High-dimensional Data, Non-parametric Statistic, Nicotine De¬ 
pendence 


1 



2 


C. WEI ET AL. 


role in the disease pathophysiology and etiology. Despite such success, for 
most complex diseases the identified genetic variants account for only a small 
proportion of the heritability. While substantial efforts have shifted toward 
finding rare variants, gene-gene/gene-environment interactions, structural 
variations, and other genetic variants accounting for the missing heritability 
[Eichler et al. (2010)], there is a considerable lack of attention being paid to 
genetic heterogeneity in the analysis of complex human diseases.We define 
genetic heterogeneity as a genetic variant having different effects on indi¬ 
viduals or on subgroups of a population (e.g., gender and ethnic groups). 
For instance, the effect size and the effect direction of the genetic vari¬ 
ant can be different according to the individuals’ genetic background, per¬ 
sonal/demographic characteristics and/or the sub-phenotype groups they 
belong to. 

Substantial evidence from a wide range of diseases suggests that complex 
diseases are characterized by remarkable genetic heterogeneity [Thornton- 
Wells et al. (2004); McClellan and King (2010); Galvan et al. (2010)] . De¬ 
spite the strong evidence of genetic heterogeneity in human disease etiology, 
investigating genetic variants with heterogeneous effects remains a great 
challenge, primarily because: i) the commonly used study designs (e.g. the 
case-control design) may not be optimal for studying heterogeneous effects; 
ii) there is a lack of prior knowledge that can be used to infer the latent 
population structure (i.e., heterogeneous subgroups in the population); iii) 
replication studies are more challenging and need to be carefully designed; 
and iv) computationally efficient and flexible statistical methods for high¬ 
dimensional data analysis, taking into account genetic heterogeneity, have 
not been well developed. Most of the existing methods assume that the 
disease under investigation is a unified phenotype with homogeneous ge¬ 
netic causes. When genetic heterogeneity is present, the current methods 
will likely yield attenuated estimates for the effects of genetic variants, lead¬ 
ing to low power of the study. 

To account for genetic heterogeneity in association analyses, we propose 
a heterogeneity weighted U, referred to as HWU. Because the new method 
is based on a weighted U statistic, it assumes no specific distribution of phe¬ 
notypes; it can be applied to both qualitative and quantitative phenotypes 
with various types of distributions. Moreover, HWU is computationally ef¬ 
ficient and has been implemented in a C-|—|- package for high-dimensional 
data analyses (https://www.msu.edu/~changsl8/software.html^HWU). 


2. Method. 
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2.1. Motivation from a Gaussian random effect model. To motivate the 
idea of the heterogeneity weighted U, we first introduce a Gaussian random 
effect model to test genetic association when considering genetic heterogene¬ 
ity. Assume the following random effect model, 

Yi = IJ. + gil3i + Si,ei ~ lV(0,o-^), 

where Yi and gi represent the phenotype and the single-locus genotype of 
individual i, respectively, gi can be coded as 0, 1, and 2 (i.e., the additive 
model), or 0 and 1 (e.g., the dominant/recessive model); is normally 
distributed, ~ ^"(0, and e* is the iid random error. Let Kij represent 

the background similarity or the latent population structure for individuals 
i and j. We assume that the more similar two individuals are, the more 
similar are their genetic effects, i.e., cov{j5i,j5j) = Kijcrl- 

We define /3 = Y = (W,--. ,y„)^, e = (ei,-- - ,en)^, G = 

{diag{gi, ■ ■ ■ ,gn)}nxn , and K = {Kijjnxn- The model can then be written 
as: y = l-t- + G(3 + e,e ^ -^(0, P ~ ^"(0, cr^K). We denote 6 = G/3, and 
rewrite the model as; T = g, + 5 + e,e N{0,a^I),6 ~ N{0,alGKG). A 
score test statistic can be formed to test the variance component = 0, 

T = Y'^GKGY, 

where Yi = {Yi — g)fa is the standardized residual under the null. We 
can partition the test statistic T into two parts, T = + 

Sr=i > where the first summation is closely related to the weighted U 
statistic introduced below. 

2.2. Heterogeneity weighted U. The Gaussian random effect model as¬ 
sumes a normal distribution. In order to consider phenotypes with various 
distributions and modes of inheritance, we develop a heterogeneity weighted 
U with rank-based U kernels and flexible weight functions. We first order the 
subjects according to their phenotypic values Yi and assign subject scores 
based on their ranks, denoted by iij, i = 1, • • • ,n. When there are ties in the 
sample, we assign the averaged rank. For example, in a case-control study 
with Nq controls (1/ = 0) and Ni cases (1/ = 1), all the controls are assigned 
a score {Nq + l)/2. The phenotypic similarity between subjects i and j can 
be defined as, 

Sig — h{RijRj')y 

where h{-,-) is a two degree mean zero symmetric kernel function (i.e., 
h{Ri,Rj) = h{Rj,Ri) and Ep{h{Ri, Rj)) = 0 ) that satisfies the finite sec¬ 
ond moment condition, EF{h?‘{Ri, Rj)) < oo, and the degenerate kernel 
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condition, var{E{h{Ri, Rj)\Rj)) = 0. In this paper, we choose h{Ri,Rj) = 
a]^{Ri — ^R){Rj — where = E{R) and = var{R). Let Gi = 
(5i,i) • • • j9i,Q) denote the multiple genetic variants for individual i. We fur¬ 
ther define a weight function to measure the genetic similarity under the 
latent population structure Kij, 

'^i,j ~ ^i,j f {Gil Gj), 

where f{Gi,Gj) represents the genetic similarity calculated based on the 
genetic variants of interest. We can then form the heterogeneity weighted 
U, referred to as HWU, 


U — 2 WijSij, 

l<i<j<n 

to evaluate the association between the phenotype and the genetic variants, 
considering the latent population structure. 

Thus HWU is a summation, over all pairs of individuals, of their phe¬ 
notypic similarities weighted by their genetic similarities. Under the null 
hypothesis of no association, the phenotypic similarity is unrelated to the ge¬ 
netic similarity. Because the phenotypic similarity has mean 0 (i.e., EF{Ri, Rj) 
0), the expectation of HWU is 0. Under the alternative, the phenotypic sim¬ 
ilarities should increase as the genetic similarities increases. The positive 
phenotypic similarities are more heavily weighted and the negative pheno¬ 
typic similarities are more lightly weighted, leading to a positive value of 
HWU under the alternative. 

2.3. Asymptotic distribution of heterogeneity weighted U. To assess the 
signihcance of the association, a permutation test can be used to calculate 
a p-value for HWU. However, for high-dimensional data, the permutation 
test could be computationally intensive. Therefore, we derive the asymptotic 
distribution of HWU under the null hypothesis. 

The asymptotic properties of the un-weighted U statistic (i.e., Wij = 1) 
are well established [Hoeffding (1948); Serfling (1981)]. When the kernel is 
non-degenerate ( var{E{h{Ri, Rj)\Rj)) > 0 ), the limiting distribution is 
normal. When the kernel is degenerate ( var{E{h{Ri, Rj)\Rj)) = 0 ), the 
limiting distribution is a sum of independent chi-square variables. However, 
the limiting distribution of the weighted U statistic depends on both the 
weight function and the kernel function [O’Neil and Redner (1993)]. Because 
non-normality also occurs for a non-degenerate kernel with certain weight 
functions, we use the degenerate kernel for HWU to obtain a unihed form 
of limiting distribution, as shown in the following derivation. 
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We first expand the kernel function as the sum of products of its 

eigenfunctions. Let {at} and {¥?*(•)} denote the eigenvalues and the corre¬ 
sponding ortho-normal eigenfunctions of the kernel. We can write as 

h{Ri, Rj) = ^t^t{Ri)^s{Rj) ) and the weighted U as, 

OO 

U = J2wi,j'^am{Ri)^s{Rj), 
i¥=j i=l 


where 


E{ipt{Ri)(ps{Rj)) 


1, t = s and i = j 
0, otherwise. 


By exchanging the two summations, (Yl'^iOit'^i^jWi^j(pt{Ri)ips{Rj)), the 
weighted U statistic is an infinite sum of quadratic forms and can be approx¬ 
imated by a linear combination of chi-square random variables [Dewet and 
Venter (1973); Shieh et al. (1994)]. Letting W = {wij}nxn be the weight 
matrix with all diagonal element equal to 0, the limiting distribution can be 
written as 

OO n 

- 1 ), 

t=l S=1 


where {A^} are the eigenvalues of the weight matrix and {xi are iid chi- 
square random variables with 1 df. In this paper, we use a cross product 
kernel, h{Ri,Rj) = a^{Ri — ^B){Rj — hr). In this case, the expansion 
of /i(-,-) can be simplified to h{Ri,Rj) = aiipi{Ri)ipi{Rj), where ai = 1 
and ^pi{R) = (t])^{R — hr)- Using this representation and the fact that 
= 0, the limiting distribution can be simplified to U e:=iA.xL. 
We also note that the parameters fiR and are unknown and need to be 
estimated from the data, which influences the limiting distribution of HWU 
[Dewet and Randles (1987); Shieh (1997)]. Taking the parameter estimation 
into account, the limiting distribution can be expressed as a weighted sum 
of independent chi-squared variables, U Er=iAi,sAi,s (Appendix A.l), 
where {Ai,s} are the eigenvalues of the matrix (/ — J)W{I — J) , in which 
I is an identity matrix and J is a matrix with all elements equal to 1/n. 

The HWU described above can also be modified to allow for covariate ad¬ 
justment. Suppose Znxp = (1, -21) • • • ) ^p) is the covariate matrix. In the cross 
product kernel of HWU, we can calculate the estimators of hr and cr^ as 
Hr = PR and (t|j = {R— hr)'^ {R— hr)/ {n—p—1), where P = Z{Z'^Z)~^Z'^. 
The limiting distribution can then be written as U r-u Es=i > where 

{A^ g} are the eigenvalues of the matrix (/ — P)W{1 — P). 
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Davies’ method [Davies (1980)] can be used to calculate the p-value 
for the association test. When the calculation involves large matrix eigen- 
decomposition, we use the state-of-the-art algorithm nu-TRLan [Wu and 
Simon (2000)] to improve the computational efficiency. 

2.4. Weighting schemes. The weight function comprises two components, 
Kij and f{Gi,Gj) . Kij measures the latent population structure, which 
could be inferred from related covariates. Depending on the type of data, 
different functions can be used to calculate Kij. For instance, we can apply 
the genome-wide averaged IBS function on GWAS data and the genome¬ 
wide weighted average IBS (WIBS) function on sequencing data to calcu¬ 
late Kij [Astle and Balding (2009)]. For environmental covariates, we can 
calculate Kij based on Euclidian distance [Jiang and Zhang (2011)]. Given 
environmental covariates, we first standardize each covariate according to its 
mean and standard deviation, denoted by Xd {xd = [xd^i, ■ ■ ■ ,Xd,n)'^ , d = 
1, 2, • • • , D ), and then calculate Kij = exp{—{xi — Xj)R{xi — XjY'), where R 
is used to reflect the relative importance (e.g., R = {diag{uJd)}DxD in which 
ujd measures the importance) or inner correlation (e.g., i? = (^ 
of the covariates. 

f{Gi,Gj) measures the genetic similarity. For a single-locus model, we 
can use the cross product f{Gi,Gj) = f{gi,gj) = giQj when the effect is 
additive. Otherwise, we can use f{Gi,Gj) = f{gi,gj) = l{gi = gj) for an 
unspecified mode of inheritance, where l(-) is the indicator function. The 
above measurements can be easily extended to handle Q multiple markers 
by using f{Gi, Gj) = Y,q=i 9q,i9q,j- 

The weight function Wij can also be specified for different purposes. For 
instance, if we choose Wij = f{Gi,Gj) (i.e., Kij = 1), then the weighted U 
tests the association without consideration of genetic heterogeneity. We refer 
to this statistic as the non-heterogeneity weighted U (NHWU). Furthermore, 
we can construct a statistic to test the presence of the heterogeneity effect, 
referred to as the pure-heterogeneity weighted U (PHWU), by setting w*j = 
{fiij - h)f{Gi, Gj), where R = ^ Yh=i Ej=i 

3. Result. 

3.1. Simulations. In simulation I and simulation II, we simulated various 
cases of genetic heterogeneity and compared the proposed HWU test with 
two other tests, NHWU and the likelihood ratio test using the conventional 
generalized linear model (GLM). In simulation III, we investigated the ro¬ 
bustness of HWU to non-normal distributions and mis-specified weight func¬ 
tions. In all sets of simulations, unless otherwise specihed we used Euclidian- 
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distance-based Kij by setting R = I and cross-product-based f{gi,gj) to 
form the weight function. For each simulation setting, we simulated 1000 
replicate datasets, each having a sample size of 1000. Power and type 1 er¬ 
ror of the methods were calculated based on the proportion of p-values in 
the 1000 replicates smaller than or equal to 0.05. 

3.1.1. Simulation I. In this simulation, we assumed two sub-populations, 
and considered both continuous and binary phenotypes. We simulated bi¬ 
nary phenotypes using the logistic model, 


= 1)) = ^ + gi,j{i)Pi, 

where i and j{i) represented respectively the z-th subpopulation and j-th 
individual in the i-th sub-population. Additionally, we introduced a covari¬ 
ate = Oi + ~ from which we infer the latent 

sub-populations. Continuous phenotypes were simulated similarly by using 
a linear regression model. The value of the regression coefficient /3i for dif¬ 
ferent models was listed in Table 1, while the details of the simulation were 
described in Supplementary Appendix A. 

No substantial inflation of type I error was detected for any of the three 
methods (Table 1). In the presence of genetic heterogeneity (i.e., Tl, T3, 
and T4 in Table 1), HWU outperformed NHWU and GLM, especially when 
the genetic effects for the two sub-populations were in the opposite direction 
(i.e., Tl). In such a case, NHWU and GLM could barely detect any genetic 
effect, while HWU had high statistical power to detect the association. In 
the absence of genetic heterogeneity (i.e., T2 in Table 1), HWU remained 
comparable in performance to NHWU and GLM. We also noted that, in the 
absence of genetic heterogeneity (i.e., T2), the non-parametric NHWU had 
almost identical power to GLM. 

We also investigated the performance of the three methods when the un¬ 
derlying phenotype distribution and the modes of inheritance were unknown 
(Supplementary Simulation I). Overall, HWU outperformed the other two 
methods. In particular, when the phenotype was non-normal, both HWU 
and NHWU had higher power than GLM (Supplementary Table SI). By us¬ 
ing f{gi,gj) = l{gi = gj), HWU was robust to the disease model when the 
mode of inheritance was unknown, e.g., heterozygote effect (Supplementary 
Table S2 and S3). 

3.1.2. Simulation II. In simulation H, we used the same simulation model 
as in simulation I, but considered a more complicated latent population 
structure by increasing the number of sub-populations to 20, and sampling 


Table 1 

Type I error and power eomparison of three methods when there are two heterogeneous 

sub-populations 


Binary Phenotype Continuous Phenotype 

Model^ Effect^ Type I error/Power Effect Type I error/Power 



Pi 

P2 

HWU 

NHWU 

GLM 

Pi 

02 

HWU 

NHWU 

GLM 

Null 

0 

0 

0.046 

0.051 

0.051 

0 

0 

0.052 

0.049 

0.049 

Tl 

-0.1 

0.1 

0.085 

0.07 

0.07 

-0.1 

0.1 

0.241 

0.062 

0.07 


-0.3 

0.3 

0.455 

0.069 

0.068 

-0.3 

0.3 

0.972 

0.157 

0.173 


-0.5 

0.5 

0.899 

0.084 

0.084 

-0.5 

0.5 

1 

0.376 

0.417 

T2 

0.1 

0.1 

0.122 

0.149 

0.15 

0.1 

0.1 

0.284 

0.368 

0.384 


0.3 

0.3 

0.601 

0.743 

0.75 

0.3 

0.3 

0.999 

1 

1 


0.5 

0.5 

0.978 

0.993 

0.993 

0.5 

0.5 

1 

1 

1 

T3 

0 

0.2 

0.107 

0.102 

0.102 

0 

0.2 

0.32 

0.273 

0.282 


0 

0.4 

0.378 

0.307 

0.31 

0 

0.4 

0.902 

0.767 

0.803 


0 

0.6 

0.718 

0.582 

0.586 

0 

0.6 

0.999 

0.978 

0.984 

T4 

-0.1 

0.3 

0.239 

0.091 

0.092 

-0.1 

0.3 

0.718 

0.161 

0.18 


0.1 

0.3 

0.304 

0.377 

0.381 

0.1 

0.3 

0.822 

0.89 

0.906 


-0.3 

0.5 

0.72 

0.069 

0.069 

-0.3 

0.5 

0.997 

0.059 

0.076 


^Various scenarios of heterogeneity were considered in the simulation, including no 
genetic effect for both sub-populations (Null), the same effect size but with different 
directions (Tl), the same effect size with the same direction (T2), no genetic effect for 
one sub-population but having a genetic effect for the other (T3), and different effect 
sizes with the same or different directions (T4). 

^Single-locus effects for the 2 sub-populations, where the effect for the sub-population 1 
denoted by /3i and the effect for the sub-population 2 denoted by /32- 
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Fig 1. Power comparison of three methods for a binary phenotype with 20 heterogeneous 
sub-populations 


PB = 0 Hp = 0.1 P(5 = 0.2 |j0 = O.3 



*the genetic effect, fdi, for the i-th sub-population was sampled from a uniform 
distribution with mean and variance 


Fig 2. Power comparison of three methods for a continuous phenotype with 20 heteroge 


neous sub-populations 
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*the genetic effect, fdi, for the i-th sub-population was sampled from a uniform 
distribution with mean /x ,9 and variance 
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Pi ( i = 1, 2, • • • ,20 ) from a uniform distribution with mean and vari¬ 
ance We simulated 25 covariates, ~ -^(0,^c)) 

( d = 1,2,--- ,25 ), to generate the latent population structure (Supple¬ 
mentary Appendix B). No substantial inflation of type I error was detected 
for any of the three methods at the 0.05 level (Supplementary Table S4). 
Through simulation, we demonstrated that HWU outperformed NHWU and 
GLM for both binary (Figure 1) and continuous (Figure 2) phenotypes. 
In the presence of genetic heterogeneity (i.e.,when is large), HWU 

attained higher power than NHWU and GLM. When the genetic hetero¬ 
geneity was negligible (i.e., when is small), HWU had comparable 

performance to NHWU and GLM. When the average genetic effect ( /r /3 ) 
increased, all three methods gained power. Nevertheless, when the variance 
of the genetic effect { crp ) increased, only HWU gained substantial increase 
in power. We also investigated the performance of HWU when the covariates 
could not accurately infer the latent population structure. For such purpose, 
we investigated the power of HWU as the noise parameter cj^ changed. The 
result showed that the power of HWU decreased as the “noise” increased 
(Supplementary Table S5). 

In practice, the nature of the latent population structure may not be “cat¬ 
egorical” . Therefore, we also simulated genetic effects using a random effect 
model, where effects were different for each subject (Supplementary Simula¬ 
tion H). The three methods had comparable power when the genetic hetero¬ 
geneity was negligible. Nevertheless, as the genetic heterogeneity increased, 
there was a clear advantage of HWU over NHWU and GLM (Supplementary 
Figures SI and S2). 

3.1.3. Simulation III. In simulation HI, we first investigated the robust¬ 
ness of HWU against different non-normal phenotype distributions. In order 
to separate the influence of heterogeneity and phenotype distribution, we 
compared HWU with its “parametric alternative”, the variance component 
score test (VCscore), instead of GLM. We simulated the phenotype using a 
random effect model, 


yi = ^1 + Zia + giPi -h £*, e* ~ F, 

where Zj denotes covariates for subject i, a denotes covariate effects and F 
followed a non-normal distribution (Supplementary Appendix C). We sim¬ 
ulated three types of non-normal distribution for F, 1) t ditributions with 
df = 2, 2) Gauchy distribution, and 3) a mixture of normal and chi-squared 
distribution. For each distribution, we simulated model with confounding ef¬ 
fects and without confounding effect, where confounding effect is simulated 
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Table 2 

Type I error comparisons of HWU and VCscore under non-normal distributions 


Confounding Effect 

Model ■ 

Distribution 

Mixture 

tdf=2 

Cauchy 

No 

HWU 

0.041 

0.057 

0.057 

VCscore 

0.039 

0.070 

0.095 

Yes 

HWU 

0.053 

0.060 

0.062 

VCscore 

0.059 

0.146 

0.250 


*the mixture distribution follows aXd/=i + (1 ~ a)N{5, 1), where a ~ Bernoulli{0.6). 


Table 3 

Performance of HWU with a mis-specified weight function 


Component 

Mis-specification'*^ 

A/rnrloU 

Method^ 

Mis 

True 


HWU(mis) 

HWU(true) 




Null 

0.047 

0.049 

f{gi,gj) ■ 

g^gj 

Alt 

0.459 

0.481 


Hgi = ft) 


Null 

0.048 

0.052 


gigj 

Alt 

0.451 

0.51 



Ti{xi, Xj) 

Null 

0.05 

0.058 


jpXiXj 

Alt 

0.174 

0.505 

T){xi,Xj) 


Null 

0.046 

0.053 


D ’d:i Xj 

Alt 

0.072 

0.515 


^“Mis” represents the misspeficied f{gi,gj) or Kij when analyzing simulated data, while 
“True” represent the true f{gi,gj) or Kij in the corresponding simulation setting. Here, 
£)(•, •) represents the euclidian distance based weight, i.e., T)(gi,gj} = exp{—{gi — gj)^) 
and Tl{xi,Xj) = exp{-j^ 

^The error distribution was set as t distribution with df = 2. “Null” represents the null 
model with fig = 0 and (j| = 0; “Alt” represents the heterogeneous effect model with 

Up = 0 and cr| = 0.5. 

^HWU(mis) represents the HWU model with a mis-specified weight function, while 
HWU(true) represents the HWU model with the true weight function. 


by generating Zj that is correlated with gi. Meanwhile, Zj is also correlated 
with Ui since a 0. We included Z in the analysis for both HWU and VC¬ 
score, and summarize the Type I errors in Table 2. No substantial inflation of 
type I error was detected for HWU for 3 non-normal distributions, regard¬ 
less of whether there were confounding effects. VCscore is robust against 
mixture of normal and chi-squared distribution, but have inflated type I er¬ 
ror for heavy tailed distribution (e.g., Cauchy distribution). If we did not 
include Z in the analysis, both methods showed inflated type I error when 
there were confounding effects(Supplementary Table S6). Further investiga¬ 
tions on power performance showed slightly more advantage of HWU over 
VCscore for non-normal distributions (Supplementary Table S7). 

We also investigated the performance of HWU when the weight function 
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Table 4 

Top 10 nicotine dependence associated SNPs from the GWAS analysis considering 

gender heterogeneity 


Name 

Chr 

Position 

Gene 


p-value 

HWU 

NHWU 

rsl7078660 

3 

46160432 

NA, near FLTIPI 

9.98 X 

lO""* 

0.017 

MitoA15302G 

26 

15302 

NA 

1.69 X 

10"® 

0.688 

rs7753843 

6 

67055504 

NA 

1.86 X 

10"® 

0.571 

rsl0493279 

1 

60368804 

NA, near Clorf87 

1.88 X 

10"® 

0.014 

rs4560769 

8 

42259961 

IKBKB 

1.93 X 

10"® 

0.062 

rs9694958 

8 

42275203 

IKBKB 

2.82 X 

10"® 

0.122 

rs776746 

7 

99108475 

CYP3A5 

2.91 X 

10"® 

0.241 

rs4646437 

7 

99203019 

CYP3A4 

4.03 X 

10"® 

0.512 

rs9694574 

8 

42279609 

IKBKB 

4.63 X 

10"® 

0.144 

rs4646457 

7 

99083016 

ZSCAN25 

4.74 X 

10"® 

0.138 


was mis-specified (Table 3). In this simulation, we considered 4 different 
scenarios, either with mis-specified f{gi,gj) or mis-specified Kjj. Type I 
error rates were well controlled when the weight function was mis-specified. 
However, we found the power of HWU with a mis-specified weight function 
was lower than that with a correct weight function, especially when Kij was 
mis-specified (Table 3). 

3.2. Genome-wide association analysis of Nicotine Dependence. We ap¬ 
plied our methods to the Genome-wide association study (GWAS) dataset 
from the Study of Addiction: Genetics and Environments (SAGE). The 
SAGE is one of the largest and most comprehensive case-control studies 
conducted to date aimed at discovering new genetic variants contributing 
to addiction. We analyzed the number of cigarettes smoked per day, catego¬ 
rized into 4 classes (0 for less than 10 cigarettes, 1 for ll-to-20 cigarettes, 
2 for 21-to-30 cigarettes, and 3 for more than 31 cigarettes). Prior to the 
statistical analysis, we reassessed the quality of the genotype data. After 
undertaking a careful quality control process (i.e., removing samples with 
missing phenotype data and low-quality genetic markers), 2845 subjects and 
949,658 single-nucleotide polymorphisms (SNPs) remained for the analysis. 
The SAGE comprises samples from both Caucasian and African-American 
populations. To make the association analysis robust against confounding 
effects, we adjusted for the first 20 principal components from the available 
genome-wide genetic markers, as well as gender and race, in the analysis. 

Considering that the etiology of nicotine dependence has been shown to 
be heterogeneous for gender [Li et al. (2003)], we used gender to infer the 
latent population structure and assumed an additive effect to compute . Us- 
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ing HWU, the genome-wide scanning of 949,658 SNPs on the SAGE dataset 
was completed in about 7 hours by parallel computation on 19 cores. The 
top 10 SNPs having the strongest association with nicotine dependence are 
listed in Table 4. Among the 10 SNPs, 3 SNPs (i.e., rs4560769, rs9694574, 
and rs9694958) are located within the gene IKBKB, while another 3 SNPs 
(i.e., rs4646437, rs4646457, rs776746) are located within or near the gene 
CYP3A5. The 3 SNPs related to gene IKBKB are in high linkage disequi¬ 
librium (LD), with the estimated correlation ranging from 0.736 to 0.853. 
The highest association signal was from rs4560769 (p-value= 1.93 x 10“®). 
The 3 SNPs related to gene CYP3A5 were also in high LD (correlation 
from 0.781 to 0.913), among which rs4646437 had the strongest association 
with nicotine dependence (p-value= 2.91 x 10“®). To evaluate the sensitivity 
of the results, we performed association tests using other weight functions 
(Table 4). Using a homogeneity weight Wij = gigj (NHWU), none of the 10 
SNPs had a p-value smaller than 0.01. The difference between HWU and 
NHWU indicated heterogeneous effects of the two genes on nicotine depen¬ 
dence in males and females. Additional stratified analysis by analyzing males 
and females separately also suggested this heterogeneous effect of the two 
genes in males and females (Supplementary Real Data Analysis). In addition 
to gender, we also investigated potential genetic heterogeneity due to dif¬ 
ferent ethnic and genetic backgrounds. In these analyses, we considered the 
same covariates as those used in the gender heterogeneity analysis. However, 
the results suggested there was no strong evidence of genetic heterogeneity 
due to different ethnic and genetic backgrounds (Supplementary Real Data 
Analysis). 

4. Discussion. In recent years, U-statistic based methods have been 
gaining popularity in genetic association studies due to their robustness 
and flexibility [Schaid et al. (2005); Zhang et al. (2010)]. Yet, few methods 
have been developed to model genetic heterogeneity, especially under the 
weighted U framework. In this paper, we have proposed a flexible and com¬ 
putationally efficient method, HWU, for high-dimensional genetic associa¬ 
tion analyses allowing for genetic heterogeneity. With HWU, we were able to 
integrate the latent population structure (inferred from genetic background 
or environmental covariates) into a weight function and test heterogeneous 
effects without stratifying the sample. Simulation studies were conducted 
to compare the power of the proposed HWU method with methods that do 
not model genetic heterogeneity (i.e., NHWU and GLM). In the presence of 
genetic heterogeneity, HWU attained higher power than NHWU and GLM. 
In the absence of genetic heterogeneity, HWU still had comparable perfor- 
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mance to NHWU and GLM. Unlike conventional methods, such as GLM, 
our method was developed based on a nonparametric U statistic, and there¬ 
fore offers robust performance when the underlying phenotype distribution 
and mode of inheritance are unknown. 

In HWU, we use genome profiles or environmental covariates to build 
the background similarity (i.e., the latent population structure Kij ) and 
combine it with the genetic similarity to form the weight function f{Gi, Gj). 
We then evaluate its relationship with a phenotype by using a weighted U 
statistic. Our method is different from testing an interaction effect. The key 
difference is that, for HWU, we assume there is a latent population structure 
that acts in some joint fashion with the genetic variants, while in the usual 
interaction effect model the genetic variants are assumed to interact with 
known variables. Furthermore, our test has fewer degrees of freedom than 
usual interaction tests. HWU is based on the idea that the more similar two 
subjects are, the more similar are their genetic effects. The idea of relating 
phenotype similarity to genotype similarity is not new. For example, Tzeng 
et.al proposed a gene-trait similarity regression for multi-locus association 
analysis [Tzeng and Zhang (2007)]. However, their method is based on the 
usual regression framework and does not consider genetic heterogeneity. 

In this paper, we focus on a single-locus test with consideration of genetic 
heterogeneity and assume an additive model. By modifying the weight func¬ 
tion, HWU can easily be extended to model a multi-locus effect and other 
modes of inheritance (e.g., dominant/recessive effects). The weight function 
also offers flexibility for constructing latent population structure. Various 
similarity-based or distance-based functions can be applied to informative 
environmental and genetic covariates to infer the latent population struc¬ 
ture. Although type I error is generally controlled for a variety of weight 
functions, the choice of an appropriate function to construct the latent pop¬ 
ulation structure could impact the power of HWU. In this article, we suggest 
a Euclidian-distance based function, Kij = exp{—{xi — Xj)R{xi — Xj)'^), in 
which prior knowledge can be incorporated for potential power improve¬ 
ment. Nevertheless, a cross product kernel (i.e., Kij = jyXixJ) can also be 
used if the underlying model favors linearity. In the scenario where multi¬ 
ple functions might be used to construct the latent population structure, 
the optimal function could be chosen by using a similar approach to that 
proposed by Lee et al. (2012). 

Another advantage of our method is its computational efficiency. For the 
analysis of high-dimensional data, we derived the asymptotic distribution of 
the weighted U statistic and optimized the computational algorithm (e.g. 
using efficient eigen-decomposition). The genome-wide analysis of 949,658 
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SNPs took 7 hours and identified two genes, IKBKB and CYP3A5. Al¬ 
though our analysis suggests that these two genes are associated with nico¬ 
tine dependence and have heterogeneous effects according to gender, further 
study and biological experiments are needed to confirm the association and 
to further investigate the potential function of these two genes in nicotine 
dependence. 


APPENDIX A 

A.l. Asymtotic distribution of HWU with parameter estima¬ 
tion. As showed in the main text, the limiting distribution of the weighted 
U with a cross product kernel can be simplified to U r-u E”=i ^sXls- Taking 
the parameter estimation into account [Dewet and Randles (1987); Shieh 
(1997)] , the limiting distribution becomes: 

n 

U ~ ^ Xs{(l>s + Cs4>o)‘^, 

s=l 

where {A^} are the eigenvalues from the eigen-decomposition of VP = BAB^, 
in which A = {diag{Xs)}nxn and B = {bij}nxn- {</*«} are i.i.d. standard 
normal random variables. 4>o is also standard normal random variable with 
cov{4>s, (po) = Cs, where Cg is defined as c* = ^ Yji=i bi,s- 

Let 7 = {(pi + ci(po,--- ,(j)n + Cn(j)o) be a random vector, where 7 = 
MVN{0, Tinxn)- Letting I be the n x n identity matrix and J be the n x n 
matrix with all elements equal to 1/n, we can easily show that S = I—B^JB 
and SE = S. Letting ^ be a random vector, ^ ~ MVN{0, Inxn), we have 
SC ~ MVN{0, S) and 

n 

^ Xs{(j)s + CgCpof = 

s=l 

= C'^SASC 

= ^^B^{BYB^)BAB^{BYB^)B^ 

= fB'^{I - J)W{I - J)B^. 

Because B^ ~ MVN{0, Inxn), tbe limiting distribution of the weighted U 
is the weighted sum of independent chi-squares, U r-u Us=i M,sxls, where 
{Ai^s} are the values of the matrix (I — J)W(I — J). 
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